James Thomas, Josh Howkins
February 2023
Vulnerability measurement through index aggregation followed the hierarchical methodology described by Tate and used in the English Indices of Multiple Deprivation (IMD) technical report.
Metrics were selected from the literature, considered for inclusion by expert panel, and translated into data available reliably at Lower Layer Super Output Area (LSOA) level. Predominantly census data was used, however if no census-derived proxy was available other sources such as the IMD were considered. If data was not published down to LSOA level, then data from the next administrative level, Middle Layer Super Output Area (MSOA), was used instead.
A total of 33 metrics were then selected for inclusion into the index, with data analysis described below, and grouped into nine sub-domains and four domains:
adaptive capacity,
health,
sensitivity,
living environment.
import geopandas as gpd
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
import shapely
from sklearn.decomposition import FactorAnalysis
import utils
%matplotlib inline
utils.set_plotting_defaults()
This notebook is stage 6 of the analysis, taking data pre-processed in a Python pipeline, provided in the code/ directory.
In the previous stages 1-5 of the analysis:
Raw data was downloaded from the internet.
Data for the area of interest (relating to South Gloucestershire) was extracted from zip and Shapefiles.
Data was joined into one file for each geography level (LSOA, MSOA).
Metrics a.k.a. indicators were calculated for each area (LSOA, MSOA) using the formulae defined in data_sources.yaml.
Individual metrics measuring different aspects or facets of the same underlying element of vulnerability were first added together if this could be done without overlap or double counting. For example, in the education metric, the categories "No qualifications", "Level 1" and "Level 2" could simply be added as there is no overlap between them. In the case of the income metric from IMD data, this summation had already been performed. Shrinkage was not applied.
The majority of combined metrics were counts of people or households in each LSOA. To allow LSOAs to be compared fairly, these were converted to percentages using an appropriate denominator that related to the total number of people or households 'at risk'. For example, for "No cars or vans in household" the appropriate denominator was the total number of households in the LSOA, whereas for the education metric it was the number of residents aged 16 years and over. The only metric that was not converted to a percentage was income, which was provided as a rank.
All metrics were joined into one file, converting to 2021 LSOAs.
The raw data downloaded in stage 1 was:
Census 2021 data - TS001, 006, 007, 008, 017, 029, 037, 038, 039, 044, 045, 054, 069 - sourced from the Office for National Statistics.
English indices of deprivation data - IMD domains and underlying indicators - sourced from the Ministry of Housing, Communities & Local Government.
Office for National Statistics geospatial data - LSOA and MSOA extents, LSOA to MSOA lookup, LSOA 2011 to 2021 lookup - sourced from the Office for National Statistics.
indicators = pd.read_csv('../data/05_resampled/indicators.csv')
lsoas = gpd.read_file('../data/02_subset/lsoa_extents.shp')
msoas = gpd.read_file('../data/02_subset/msoa_extents.shp')
indicators = (
lsoas[['LSOA21CD', 'geometry']]
.merge(indicators, left_on='LSOA21CD', right_on='lsoa21cd')
.drop(columns='LSOA21CD')
)
print(indicators.columns)
Index(['geometry', 'lsoa21cd', 'lsoa21nm', 'sex__female__pc',
'english_proficiency__low__pc', 'qualification__level_2_or_lower__pc',
'health__bad_or_very_bad__pc', 'disabled_or_long_term_condition__pc',
'unpaid_care__any__pc', 'population__1k_per_km2',
'accomodation__mobile_or_temporary__pc', 'cars__none_in_household__pc',
'tenure__rented_social_or_private__pc', 'household__living_alone__pc',
'age__under_5__pc', 'age__65_and_over__pc', 'income__rank',
'housing_poor__indicator'],
dtype='object')
to_process = indicators.drop(columns=['lsoa21cd', 'lsoa21nm', 'geometry'])
The distributions of and correlations between the metrics were examined.
Notes:
The metric mobile or temporary accommodation was highly skewed and was not included in the later analysis.
The metrics sex and population density were also not included in the later analysis.
def plot_distributions(data, n_columns=3):
data = pd.DataFrame(data)
n_rows = 1 + (len(data.columns) - 1) // n_columns
with plt.rc_context():
plt.rcParams['axes.titlesize'] = 10
axs = data.hist(bins=30, figsize=(3*n_columns, 1.5*n_rows), layout=(n_rows, n_columns), grid=False)
plt.tight_layout()
plot_distributions(to_process)
sns.pairplot(to_process)
<seaborn.axisgrid.PairGrid at 0x7f3af9a93160>
Before being combined into sub-domains (or in the case of health, domain) the metrics were first placed on a common and dimensionless scale by ranking their values and transforming to a standard normal distribution based on those ranks.
def transform_to_normal(df):
from scipy.stats import norm
R = (df.rank() - 0.5) / len(df)
X = norm.ppf(R)
if type(df) is pd.DataFrame:
return pd.DataFrame(X, columns=R.columns)
else:
return pd.Series(X)
transformed = transform_to_normal(to_process)
plot_distributions(transformed)
sns.pairplot(transformed)
<seaborn.axisgrid.PairGrid at 0x7f3ae7853c70>
Before grouping the metrics into sub-domains, an exploratory factor analysis with 'varimax rotation' was conducted to see what groups might naturally fall out of the data.
Note:
def search_for_number_of_components(data, possible_components=range(1, 6), title=None):
from sklearn.model_selection import cross_val_score
fa_model = FactorAnalysis()
scores = []
for n in possible_components:
if n > len(data.columns):
# Not enough columns for this number of components, so skip
continue
fa_model.n_components = n
scores.append(cross_val_score(fa_model, data).mean())
plt.figure(figsize=(0.5*len(scores), 4))
plt.plot(possible_components[:len(scores)], scores, '.-')
plt.xlabel('n_components')
plt.ylabel('score')
best_score = np.max(scores)
chosen_idx = np.argmax(scores)
plt.axhline(best_score, color='k')
plt.axvline(possible_components[chosen_idx], ls='--', color='k')
if title:
plt.title(title)
trial_factors = [
#'sex__female__pc',
'english_proficiency__low__pc',
'qualification__level_2_or_lower__pc',
'health__bad_or_very_bad__pc',
'disabled_or_long_term_condition__pc',
'unpaid_care__any__pc',
'population__1k_per_km2',
#'accomodation__mobile_or_temporary__pc',
'cars__none_in_household__pc',
'tenure__rented_social_or_private__pc',
'household__living_alone__pc',
'age__under_5__pc',
'age__65_and_over__pc',
'income__rank',
'housing_poor__indicator',
]
to_fit = transformed[trial_factors]
search_for_number_of_components(to_fit, possible_components=range(1,15))
A higher score cross-validation score indicates a potentially better model.
There are possibly 4, 5 or 6 components.
Examining a model with up to 6 components:
def plot_loadings(data, n_components, title=None):
if n_components > len(data.columns):
# Not enough columns for this number of components, so do nothing
return
fa_model = FactorAnalysis(n_components=n_components, rotation='varimax')
fa_model.fit(data)
fig = plt.figure(figsize=(0.5 * min(n_components, len(data.columns)), 0.5 * len(data.columns)))
ax = fig.add_axes((0, 0, 1, 1))
sns.heatmap(fa_model.components_.T, cmap='RdBu_r', vmin=-1, vmax=1, annot=True, fmt='0.2f', ax=ax, yticklabels=data.columns, cbar=False)
if title:
plt.title(title)
plot_loadings(to_fit, n_components=6)
The loadings are the weightings that would be applied to each metric within a component.
Usually:
Metrics with a small weighting (e.g. lower than 0.4 in magnitude) are not included within that component.
Metrics are only included in one component - if a metric is heavily loaded on more than one component, others often exclude it from the final calculation, citing a complex relationship.
The domain groupings suggested by expert panel were also tested with the same factor analysis.
factor_groups = {
'adaptive_capacity': [
'english_proficiency__low__pc',
'qualification__level_2_or_lower__pc',
'income__rank',
'unpaid_care__any__pc',
],
'health': [
'health__bad_or_very_bad__pc',
'disabled_or_long_term_condition__pc',
],
'sensitivity': [
'age__under_5__pc',
'age__65_and_over__pc',
],
'living_environment': [
'cars__none_in_household__pc',
'tenure__rented_social_or_private__pc',
'housing_poor__indicator',
'household__living_alone__pc',
],
}
for factor_name, trial_factors in factor_groups.items():
to_fit = transformed[trial_factors]
search_for_number_of_components(to_fit, title=factor_name)
plot_loadings(to_fit, n_components=1, title=factor_name)
plot_loadings(to_fit, n_components=2, title=factor_name)
plot_loadings(to_fit, n_components=3, title=factor_name)
plot_loadings(to_fit, n_components=4, title=factor_name)
Findings:
english_proficiency__low__pc and unpaid_care__any__pc because they do not represent the same underlying factor, and they are not well correlated:to_process.plot.scatter(x='english_proficiency__low__pc', y='unpaid_care__any__pc')
<Axes: xlabel='english_proficiency__low__pc', ylabel='unpaid_care__any__pc'>
Health - 1 component indicated, so suggested combine directly to domain with no separate sub-domains.
Sensitivity - 1 component indicated, however the loadings are in opposite directions which may be due to correlation between the variables. It is suggested both these populations (<5 and ≥65 years old) are likely to vulnerable to heat in potentially different ways, so they should be included in separate sub-domains.
Living environment - 1 component suggested, however suggested use 2 components (sub-domains) because housing_poor__indicator and tenure__rented_social_or_private__pc are not well correlated:
to_process.plot.scatter(x='housing_poor__indicator', y='tenure__rented_social_or_private__pc')
<Axes: xlabel='housing_poor__indicator', ylabel='tenure__rented_social_or_private__pc'>
The groupings are now as below.
Where there is more than one metric per sub-domain (or in the case of health, domain), use a factor analysis to check whether they should be added, or whether one should be subtracted from the other to indicate vulnerability.
If necessary, the sign is also flipped so that a larger number indicates increase vulnerability.
Adaptive capacity, income:
qualification__level_2_or_lower__pc minus income__rank (as expected)
Health:
health__bad_or_very_bad__pc plus disabled_or_long_term_condition__pc
Living environment, short-term adaptation:
cars__none_in_household__pc plus household__living_alone__pc
factor_groups = {
'adaptive_capacity__income': [
'income__rank',
'qualification__level_2_or_lower__pc',
],
'adaptive_capacity__language': [
'english_proficiency__low__pc',
],
'adaptive_capacity__others': [
'unpaid_care__any__pc',
],
'health': [
'health__bad_or_very_bad__pc',
'disabled_or_long_term_condition__pc',
],
'sensitivity__younger': [
'age__under_5__pc',
],
'sensitivity__older': [
'age__65_and_over__pc',
],
'living_environment__condition': [
'housing_poor__indicator',
],
'living_environment__short_term_adaptation': [
'cars__none_in_household__pc',
'household__living_alone__pc',
],
'living_environment__longer_term_adaptation': [
'tenure__rented_social_or_private__pc',
],
}
for factor_name, trial_factors in factor_groups.items():
if len(trial_factors) < 2:
continue
to_fit = transformed[trial_factors]
search_for_number_of_components(to_fit, title=factor_name)
plot_loadings(to_fit, n_components=1, title=factor_name)
The normalised (transformed to normal distribution) metrics were then combined to produce sub-domain scores (or in the case of health, domain scores).
sub_domain_formulae = {
'adaptive_capacity__income': 'qualification__level_2_or_lower__pc - income__rank',
'adaptive_capacity__language': 'english_proficiency__low__pc',
'adaptive_capacity__others': 'unpaid_care__any__pc',
'health__health': 'health__bad_or_very_bad__pc + disabled_or_long_term_condition__pc',
'sensitivity__younger': 'age__under_5__pc',
'sensitivity__older': 'age__65_and_over__pc',
'living_environment__short_term_adaptation': 'household__living_alone__pc + cars__none_in_household__pc',
'living_environment__longer_term_adaptation': 'tenure__rented_social_or_private__pc',
'living_environment__condition': 'housing_poor__indicator',
}
sub_domain_scores = pd.DataFrame({
sub_domain: transformed.eval(formula)
for sub_domain, formula in sub_domain_formulae.items()
})
plot_distributions(sub_domain_scores)
The sub-domain scores were then converted to a truncated exponential distribution according to Appendix F of the IMD technical report, by ranking and then using the equation:
$$ X = -23 \ln(1 - R(1 - \exp^{(-100/23)})) $$where $X$ is the converted sub-domain score and $R$ is the rank, scaled to the range $(0, 1]$.
The conversion to a truncated exponential was performed so that if an LSOA scored highly in one sub-domain (or domain), this could not be cancelled out by a low score in another. As such, vulnerability in one sub-domain generally rendered the LSOA as vulnerable overall.
def transform_to_truncated_exponential(df, d=23):
from scipy.stats import truncexpon
R = df.rank() / len(df)
#X = -d * np.log(1 - R*(1 - np.exp(-100/d)))
X = d * truncexpon.ppf(R, b=100/d)
if type(df) is pd.DataFrame:
return pd.DataFrame(X, columns=R.columns)
else:
return pd.Series(X)
sub_domain_scores_transformed = transform_to_truncated_exponential(sub_domain_scores)
plot_distributions(sub_domain_scores_transformed)
The sub-domains were summed with equal weighting to yield the domain scores.
domain_formulae = {
'adaptive_capacity': 'adaptive_capacity__income + adaptive_capacity__language + adaptive_capacity__others',
'health': 'health__health',
'sensitivity': 'sensitivity__younger + sensitivity__older',
'living_environment': 'living_environment__short_term_adaptation + living_environment__longer_term_adaptation + living_environment__condition',
}
domain_scores = pd.DataFrame({
domain: sub_domain_scores_transformed.eval(formula)
for domain, formula in domain_formulae.items()
})
plot_distributions(domain_scores)
The domain scores were again converted to a truncated exponential distribution to reduce cancellation effects.
The domain scores were than summed with equal weighting (in the absence of a strong justification to apply weighting) to yield the overall vulnerability index.
domain_scores_transformed = transform_to_truncated_exponential(domain_scores)
plot_distributions(domain_scores_transformed)
index_formula = 'adaptive_capacity + health + sensitivity + living_environment'
index_score = domain_scores_transformed.eval(index_formula)
plot_distributions(index_score)
The overall vulnerability index was also ranked and converted to percentiles and deciles for reporting.
index = pd.DataFrame({
'score': index_score,
'rank': index_score.rank().astype(int),
'percentile': 1 + pd.qcut(index_score, 100, labels=False),
'decile': 1 + pd.qcut(index_score, 10, labels=False),
})
initial_columns = ['lsoa21cd', 'lsoa21nm']
index = pd.concat([
indicators[initial_columns],
index,
domain_scores_transformed,
sub_domain_scores_transformed,
indicators.drop(columns=initial_columns),
], axis=1)
index = gpd.GeoDataFrame(index)
index.explore('score', tooltip_kwds={'localize': True})
index.explore('rank', tooltip_kwds={'localize': True})
index.explore('decile', tooltip_kwds={'localize': True})
index.drop(columns='geometry').to_csv('../data/06_index/index.csv', index=False, float_format='%0.3f')
index.to_file('../data/06_index/index.geojson', index=False, float_format='%0.3f')
index.explore(
column='decile',
tooltip=list(index.columns)[:9],
tooltip_kwds={'localize': True},
popup=True,
).save(('../data/06_index/index.html'))
The code below this point was used during the development of the methodology and was not included in the final analyis.
from sklearn.metrics import confusion_matrix
def plot_confusion_matrix(class_0, class_1):
cm = confusion_matrix(class_0, class_1)
cm_diagonal = np.where(np.eye(*cm.shape), cm, np.nan)
cm_non_diagonal = np.where(np.eye(*cm.shape), np.nan, cm)
axis_labels = range(1, len(cm) + 1)
sns.heatmap(
cm,
xticklabels=axis_labels,
yticklabels=axis_labels,
annot=True,
square=True,
vmin=0,
cbar=False,
).invert_yaxis()
#sns.heatmap(
# cm_diagonal,
# xticklabels=axis_labels,
# yticklabels=axis_labels,
# annot=True,
# square=True,
# cmap='RdYlGn',
# vmin=0,
# cbar=False,
#).invert_yaxis()
#sns.heatmap(
# cm_non_diagonal,
# xticklabels=axis_labels,
# yticklabels=axis_labels,
# annot=True,
# square=True,
# cmap='RdYlGn_r',
# vmin=0,
# cbar=False,
#).invert_yaxis()
plt.show()
exp = transform_to_truncated_exponential
norm = transform_to_normal
score_0 = (
exp(
exp(norm(indicators['age__65_and_over__pc']))
+ exp(norm(indicators['age__under_5__pc']) + norm(indicators['english_proficiency__low__pc']))
+ exp(norm(indicators['qualification__level_2_or_lower__pc']) - norm(indicators['income__rank']))
)
+ exp(
exp(norm(indicators['health__bad_or_very_bad__pc']) + norm(indicators['disabled_or_long_term_condition__pc']))
+ exp(norm(indicators['unpaid_care__any__pc']))
)
+ exp(
exp(norm(indicators['household__living_alone__pc']) + norm(indicators['cars__none_in_household__pc']))
+ exp(norm(indicators['tenure__rented_social_or_private__pc']) + norm(indicators['housing_poor__indicator']))
)
)
score_0_decile = pd.qcut(score_0, 10, labels=False)
score_1 = (
exp(
exp(norm(indicators['age__65_and_over__pc']))
+ exp(norm(indicators['age__under_5__pc']))
+ exp(norm(indicators['qualification__level_2_or_lower__pc']) - norm(indicators['income__rank']))
)
+ exp(
exp(norm(indicators['health__bad_or_very_bad__pc']) + norm(indicators['disabled_or_long_term_condition__pc']))
+ exp(norm(indicators['unpaid_care__any__pc']))
)
+ exp(
exp(norm(indicators['household__living_alone__pc']) + norm(indicators['cars__none_in_household__pc']))
+ exp(norm(indicators['tenure__rented_social_or_private__pc']) + norm(indicators['housing_poor__indicator']))
)
)
score_1_decile = pd.qcut(score_1, 10, labels=False)
plt.scatter(score_0, score_1)
plt.axis('square')
plt.show()
plot_confusion_matrix(score_0_decile, score_1_decile)
print('rms difference in scores:', np.sqrt(np.mean((score_0 - score_1)**2)))
print('rms difference in deciles:', np.sqrt(np.mean((score_0_decile - score_1_decile)**2)))
rms difference in scores: 9.11095873334761 rms difference in deciles: 0.8618916073713346
score_1 = (
exp(
exp(norm(indicators['age__65_and_over__pc']))
+ exp(norm(indicators['age__under_5__pc']))
+ exp(-norm(indicators['income__rank']))
)
+ exp(
exp(norm(indicators['health__bad_or_very_bad__pc']))
+ exp(norm(indicators['unpaid_care__any__pc']))
)
+ exp(
exp(norm(indicators['household__living_alone__pc']) + norm(indicators['cars__none_in_household__pc']))
+ exp(norm(indicators['tenure__rented_social_or_private__pc']) + norm(indicators['housing_poor__indicator']))
)
)
score_1_decile = pd.qcut(score_1, 10, labels=False)
plt.scatter(score_0, score_1)
plt.axis('square')
plt.show()
plot_confusion_matrix(score_0_decile, score_1_decile)
print('rms difference in scores:', np.sqrt(np.mean((score_0 - score_1)**2)))
print('rms difference in deciles:', np.sqrt(np.mean((score_0_decile - score_1_decile)**2)))
rms difference in scores: 12.035313857308715 rms difference in deciles: 0.9501879513323366
score_1 = (
# Adaptive capacity
exp(
exp(norm(indicators['qualification__level_2_or_lower__pc']) - norm(indicators['income__rank']))
+ exp(norm(indicators['english_proficiency__low__pc']))
)
# Sensitivity
+ exp(
exp(norm(indicators['age__65_and_over__pc']))
+ exp(norm(indicators['age__under_5__pc']))
+ exp(norm(indicators['health__bad_or_very_bad__pc']) + norm(indicators['disabled_or_long_term_condition__pc']))
+ exp(norm(indicators['unpaid_care__any__pc']))
)
# Local environment
+ exp(
exp(norm(indicators['household__living_alone__pc']) + norm(indicators['cars__none_in_household__pc']))
+ exp(norm(indicators['tenure__rented_social_or_private__pc']) + norm(indicators['housing_poor__indicator']))
)
)
score_1_decile = pd.qcut(score_1, 10, labels=False)
plt.scatter(score_0, score_1)
plt.axis('square')
plt.show()
plot_confusion_matrix(score_0_decile, score_1_decile)
print('rms difference in scores:', np.sqrt(np.mean((score_0 - score_1)**2)))
print('rms difference in deciles:', np.sqrt(np.mean((score_0_decile - score_1_decile)**2)))
rms difference in scores: 11.680724083834782 rms difference in deciles: 0.931971796017148
i = index.copy()
i['tertile'] = 1 + pd.qcut(i['score'], 3, labels=False)
m = i.explore(
column='tertile',
cmap='Reds',
categorical=True,
legend_kwds={'labels': ['1', '2', '3']},
tooltip=['tertile'] + list(index.columns)[:9],
tooltip_kwds={'localize': True},
popup=True,
)
m.save(('../data/06_index/index_tertile.html'))
m
i = index.copy()
i['high'] = 1*((1 + pd.qcut(i['score'], 10, labels=False)) >= 9)
m = i.explore(
column='high',
cmap='Reds',
categorical=True,
tooltip=['high'] + list(index.columns)[:9],
tooltip_kwds={'localize': True},
popup=True,
)
m.save(('../data/06_index/index_high.html'))
m
i = index.copy()
i['high_med'] = 1*((1 + pd.qcut(i['score'], 10, labels=False)) >= 9) + 1*((1 + pd.qcut(i['score'], 10, labels=False)) >= 6)
m = i.explore(
column='high_med',
cmap='Reds',
categorical=True,
tooltip=['high_med'] + list(index.columns)[:9],
tooltip_kwds={'localize': True},
popup=True,
)
m.save(('../data/06_index/index_high_med.html'))
m